R Labs for AREC 513 - Econometric Applications
  • Home
  • Lab 1
  • Lab 2
  • Lab 3
  • Lab 4
  • Lab 5
  • Lab 6
  • Answer Key

Section

  • 1 Data Manipulation using dplyr
    • 1.1 What is a tidy data set?
    • 1.2 Create a farm business data set
    • 1.3 Important Functions in dplyr
      • 1.3.1 select()
      • 1.3.2 filter()
      • 1.3.3 arrange()
      • 1.3.4 mutate()
      • 1.3.5 summarize()
      • 1.3.6 group_by() and %>%
      • 1.3.7 Other Functions/Verbs
    • 1.4 Export and Import Data
      • 1.4.1 RData Format
      • 1.4.2 csv Format
      • 1.4.3 Other Format
    • 1.5 Useful Resources
      • 1.5.1 dplyr Cheat Sheet
      • 1.5.2 R for Data Science
  • 2 Data Visualization using ggplot
    • 2.1 Introduction to ggplot2
    • 2.2 Layered Grammar of Graphics
    • 2.3 Layers in ggplot2
      • 2.3.1 Geometric Layers
      • 2.3.2 Facets
      • 2.3.3 Scales
      • 2.3.4 Legends
      • 2.3.5 Themes
    • 2.4 Example
    • 2.5 Useful resources

Lab 2: Data Manipulation and Visualization

Author

Feng Qiu, Liyuan Xuan

Published

September 22, 2025

In Lab 1, we briefly introduced what packages are in R and one specific package tidyverse. If you wish to learn more about tidyverse, click here for more information. Lab 2 will focus on two packages that are included in tidyverse:

  1. dplyr for data manipulation

  2. ggplot for data visualization

But first, remember to load the package.

# install.packages("tidyverse")     # install if needed
library(tidyverse)

1 Data Manipulation using dplyr

1.1 What is a tidy data set?

Tidy data is a standard way of mapping the meaning of a dataset to its structure. A dataset is messy or tidy depending on how rows, columns and tables are matched up with observations, variables and types. Three rules make a data tidy:

  1. Each variable must have its own column

  2. Each observation must have its own row

  3. Each value must have its own cell

1.2 Create a farm business data set

# farmers' info
name <- c("Henry", "Larry", "Alex", "Gaby", "Amy", "Ruby")
sex <- c("male", "male", "male", "female", "female", "female")
age <- c(43, 60, 25, 50, 28, 58)

# types of farm
type <- c("crop", "livestock", "urban", "dairy", "crop", "livestock")

# size of farm in acres
size <- c(550, 800, 10, 600, 1000, 700)

# net annual cash return from ag businesses, in $1000
return <- c(40, 90, 50, 90, 90, 95)

# combine the variables together as a data frame
farm <- data.frame(name, age, sex, type, size, return)
farm
name age sex type size return
Henry 43 male crop 550 40
Larry 60 male livestock 800 90
Alex 25 male urban 10 50
Gaby 50 female dairy 600 90
Amy 28 female crop 1000 90
Ruby 58 female livestock 700 95
# glimpse the data set
glimpse(farm)
Rows: 6
Columns: 6
$ name   <chr> "Henry", "Larry", "Alex", "Gaby", "Amy", "Ruby"
$ age    <dbl> 43, 60, 25, 50, 28, 58
$ sex    <chr> "male", "male", "male", "female", "female", "female"
$ type   <chr> "crop", "livestock", "urban", "dairy", "crop", "livestock"
$ size   <dbl> 550, 800, 10, 600, 1000, 700
$ return <dbl> 40, 90, 50, 90, 90, 95

1.3 Important Functions in dplyr

There are six important functions in dplyr are:

  1. select(): pick variables by their names

  2. filter(): pick observations by their values

  3. arrange(): reorder the rows

  4. mutate(): create new variables with functions of existing variables

  5. summarize(): collapse many values down to a single summary

  6. group_by(): groups data by one or more variables, allowing subsequent operations to be applied independently to each group

Combining with the pipe operator %>%, dplyr can make data manipulation simple and intuitive.

Tip

You can always type “?FUNCTION_NAME” in the Console pane to check the R Documentation for the function. Try ?select.

1.3.1 select()

select() allows you to focus on the variables you’re interested in.

select(farm, c(type, size, return))     # select farm type, size and return
type size return
crop 550 40
livestock 800 90
urban 10 50
dairy 600 90
crop 1000 90
livestock 700 95
select(farm, sex:size)      # select everything between sex and size
sex type size
male crop 550
male livestock 800
male urban 10
female dairy 600
female crop 1000
female livestock 700
select(farm, -name)     # select everything but names
age sex type size return
43 male crop 550 40
60 male livestock 800 90
25 male urban 10 50
50 female dairy 600 90
28 female crop 1000 90
58 female livestock 700 95

1.3.2 filter()

filter() allows you to subset observations based on their values.

filter(farm, size > 500)      # select farms with size > 500
name age sex type size return
Henry 43 male crop 550 40
Larry 60 male livestock 800 90
Gaby 50 female dairy 600 90
Amy 28 female crop 1000 90
Ruby 58 female livestock 700 95
filter(farm, size > 500 & sex == "female")      # select farms with size > 500 AND owned by female farmers
name age sex type size return
Gaby 50 female dairy 600 90
Amy 28 female crop 1000 90
Ruby 58 female livestock 700 95

1.3.3 arrange()

arrange() orders the observations by one or more variables. Basically, it changes the order of rows.

arrange(farm, size)      # order the data set by farm size, by default, in ascending order
name age sex type size return
Alex 25 male urban 10 50
Henry 43 male crop 550 40
Gaby 50 female dairy 600 90
Ruby 58 female livestock 700 95
Larry 60 male livestock 800 90
Amy 28 female crop 1000 90
arrange(farm, desc(size))      # change the ordering to descending
name age sex type size return
Amy 28 female crop 1000 90
Larry 60 male livestock 800 90
Ruby 58 female livestock 700 95
Gaby 50 female dairy 600 90
Henry 43 male crop 550 40
Alex 25 male urban 10 50

1.3.4 mutate()

mudate() modifies existing variables or adds new variables.

mutate(farm, return = return * 1000)
name age sex type size return
Henry 43 male crop 550 40000
Larry 60 male livestock 800 90000
Alex 25 male urban 10 50000
Gaby 50 female dairy 600 90000
Amy 28 female crop 1000 90000
Ruby 58 female livestock 700 95000
mutate(farm, age.sq = age ^ 2)
name age sex type size return age.sq
Henry 43 male crop 550 40 1849
Larry 60 male livestock 800 90 3600
Alex 25 male urban 10 50 625
Gaby 50 female dairy 600 90 2500
Amy 28 female crop 1000 90 784
Ruby 58 female livestock 700 95 3364
mutate(farm, per.acre.return = return / size)
name age sex type size return per.acre.return
Henry 43 male crop 550 40 0.0727273
Larry 60 male livestock 800 90 0.1125000
Alex 25 male urban 10 50 5.0000000
Gaby 50 female dairy 600 90 0.1500000
Amy 28 female crop 1000 90 0.0900000
Ruby 58 female livestock 700 95 0.1357143
# Or, you can do all three in one step
mutate(farm,
  return = return * 1000, 
  age.sq = age ^ 2,
  per.acre.return = return / size
)
name age sex type size return age.sq per.acre.return
Henry 43 male crop 550 40000 1849 72.72727
Larry 60 male livestock 800 90000 3600 112.50000
Alex 25 male urban 10 50000 625 5000.00000
Gaby 50 female dairy 600 90000 2500 150.00000
Amy 28 female crop 1000 90000 784 90.00000
Ruby 58 female livestock 700 95000 3364 135.71429
# change the classes of variables
glimpse(farm)      # view the data before changes
Rows: 6
Columns: 6
$ name   <chr> "Henry", "Larry", "Alex", "Gaby", "Amy", "Ruby"
$ age    <dbl> 43, 60, 25, 50, 28, 58
$ sex    <chr> "male", "male", "male", "female", "female", "female"
$ type   <chr> "crop", "livestock", "urban", "dairy", "crop", "livestock"
$ size   <dbl> 550, 800, 10, 600, 1000, 700
$ return <dbl> 40, 90, 50, 90, 90, 95
farm2 <- mutate(farm,
                sex = as.factor(sex), 
                type = as.factor(type), 
                age = as.integer(age)
         )
glimpse(farm2)       # view the data after changes
Rows: 6
Columns: 6
$ name   <chr> "Henry", "Larry", "Alex", "Gaby", "Amy", "Ruby"
$ age    <int> 43, 60, 25, 50, 28, 58
$ sex    <fct> male, male, male, female, female, female
$ type   <fct> crop, livestock, urban, dairy, crop, livestock
$ size   <dbl> 550, 800, 10, 600, 1000, 700
$ return <dbl> 40, 90, 50, 90, 90, 95

The function else() is often used in data manipulation, which assigns values to a variable based on whether a condition is satisfied.

mutate(farm,
       size2 = ifelse(size > 600, "big", "small"),   
       dummy_urban = ifelse(type == "urban", 1, 0)      # when testing for equality, use double ==
)
name age sex type size return size2 dummy_urban
Henry 43 male crop 550 40 small 0
Larry 60 male livestock 800 90 big 0
Alex 25 male urban 10 50 small 1
Gaby 50 female dairy 600 90 small 0
Amy 28 female crop 1000 90 big 0
Ruby 58 female livestock 700 95 big 0
Exercise

Generate a new variable called size3 that meets the following criterion:

  • size3 = “small” if size <= 200
  • size3 = “median” if 200 < size <= 600
  • size3 = “big” if size > 600

Finally, convert size3 to a factor variable.

1.3.5 summarize()

summarize() provides summary statistics, which always produce one single row if there are no grouping variables.

summarize(farm, tot.return = sum(return))
tot.return
455
summarize(farm, avg.return = mean(return))
avg.return
75.83333
summarize(farm,
          youngest = min(age),
          oldest = max(age),
          median = median(age),
          cor.size.return = cor(size, return))
youngest oldest median cor.size.return
25 60 46.5 0.6787267
Tip

It is often the case that we wish to know the summary statstics by a certain groups, e.g. average return by gender. Therefore, the use of summarize() is usually combined with group_by() and the pipe operator %>%.

1.3.6 group_by() and %>%

1.3.6.1 group_by()

group_by() groups data by named variables, the use of group_by() itself does not change any variables, but only re-order the data, simlar to arrange().

group_by(farm, sex)
name age sex type size return
Henry 43 male crop 550 40
Larry 60 male livestock 800 90
Alex 25 male urban 10 50
Gaby 50 female dairy 600 90
Amy 28 female crop 1000 90
Ruby 58 female livestock 700 95
1.3.6.2 %>%

However, then main purpose of group_by() is to group your data to perform following operation. To achieve this, you will also need the pipe operator %>%. Functioning like pipes, %>% uses the output of one function as the input to the next function.

Suppose you wish to perform the steps below, based on the data farm:

  1. calculate the return per acre, called “per.acre.return”
  2. keep only the farms that are owned by farmers above 40 years old
  3. create a new data frame only contains: names and age of the farmers, and the return per acre
### without %>%
farm_wo_pipe1 <- mutate(farm, per.acre.return = return / size)
farm_wo_pipe2 <- filter(farm_wo_pipe1, age > 40)
farm_wo_pipe3 <- select(farm_wo_pipe2, c(name, age, per.acre.return))
farm_wo_pipe3
name age per.acre.return
Henry 43 0.0727273
Larry 60 0.1125000
Gaby 50 0.1500000
Ruby 58 0.1357143
### with %>%

farm_w_pipe <- farm %>% mutate(per.acre.return = return / size) %>%
                        filter(age > 40) %>%
                        select(name, age, per.acre.return)
farm_w_pipe
name age per.acre.return
Henry 43 0.0727273
Larry 60 0.1125000
Gaby 50 0.1500000
Ruby 58 0.1357143
1.3.6.3 Combining group_by() with %>%

Now, let’s calculate summary statistics by groups, using group_by() with %>%.

farm %>% group_by(sex) %>% summarize(num.farmer = n(),
                                     youngest = min(age),
                                     oldest = max(age),  
                                     
                                     tot.return = sum(return),
                                     avg.return = mean(return),
                                     avg.per.acre.return = mean(return/size),
                                     avg.size = mean(size))
sex num.farmer youngest oldest tot.return avg.return avg.per.acre.return avg.size
female 3 28 58 275 91.66667 0.1252381 766.6667
male 3 25 60 180 60.00000 1.7284091 453.3333
Exercise

Generate the following summary statistics, for each type of the farms:

  1. the sum of all returns, called tot.return
  2. the average returns, called avg.return

Finally, rearrange the data based on the value of avg.return, in the descending order.

1.3.7 Other Functions/Verbs

1.3.7.1 slice() and Its Variants

You can use slice() to select rows by position, or it variants

  • slice_head() and slice_tail(): to select first/last rows
  • slice_min() and slice_max(): to select rows with minimum/maximum values
  • slice_sample(): to select random samples
farm
name age sex type size return
Henry 43 male crop 550 40
Larry 60 male livestock 800 90
Alex 25 male urban 10 50
Gaby 50 female dairy 600 90
Amy 28 female crop 1000 90
Ruby 58 female livestock 700 95
farm %>% slice(3)     # pick the observation in row 3
name age sex type size return
Alex 25 male urban 10 50
farm %>% slice(1:3)     # pick observations from row 1 through row 3
name age sex type size return
Henry 43 male crop 550 40
Larry 60 male livestock 800 90
Alex 25 male urban 10 50
farm %>% slice_head(n = 3)      # pick first 3 rows, slice_tail would pick the last 3 rows
name age sex type size return
Henry 43 male crop 550 40
Larry 60 male livestock 800 90
Alex 25 male urban 10 50
farm %>% slice_min(age, n = 3)      # pick 3 rows with the youngest ages, slice_max would pick 3 rows with the largest ages
name age sex type size return
Alex 25 male urban 10 50
Amy 28 female crop 1000 90
Henry 43 male crop 550 40
farm %>% slice_sample(n = 3)      # randomly pick 3 observations 
name age sex type size return
Ruby 58 female livestock 700 95
Gaby 50 female dairy 600 90
Amy 28 female crop 1000 90
farm %>% slice_sample(prop = 0.5)     # randomly pick 50% of the data 
name age sex type size return
Ruby 58 female livestock 700 95
Amy 28 female crop 1000 90
Henry 43 male crop 550 40
1.3.7.2 count()

count() counts the number of observations for each category.

count(farm)     # count the number of observations
n
6
count(farm, type)     # count observations per type of farm
type n
crop 2
dairy 1
livestock 2
urban 1
count(farm, type, order = TRUE)     # add argument for order
type order n
crop TRUE 2
dairy TRUE 1
livestock TRUE 2
urban TRUE 1
count(farm, type, wt = return, sort = TRUE)     # add argument for weight
type n
livestock 185
crop 130
dairy 90
urban 50

1.4 Export and Import Data

This section introduces functions in base R allowing you to export your data for later usage or import your saved data. To learn more about import/export data, check out this link.

1.4.1 RData Format

### export 
save(farm, file = "farm.Rdata")     # save to the current working directory
# specify the file path if you wish to save to a different location

### import
load("farm.Rdata")      # load from the current working directory
# specify the file path if your file is loaded 

1.4.2 csv Format

### export
write.csv(farm, "farm.csv")

### import
farm <- read.csv("farm.csv")

1.4.3 Other Format

If you are working with SPSS, Stata or SAS data files, haven is a good package for importing and exporting files of those formats.

Tip

A handy trick to import data interactively, without the need of specifying a path, try read.csv(file.choose()).

1.5 Useful Resources

1.5.1 dplyr Cheat Sheet

Click here for more information

1.5.2 R for Data Science

See Chapter 5 of R for Data Science, by Wickham, H., & Grolemund, G.

2 Data Visualization using ggplot

library(tidyverse) 
library(gapminder) # for additional data
library(patchwork) # optional, used to show graphs side by side

2.1 Introduction to ggplot2

ggplot2 is a plotting package that provides power commands to create graphs from data in a data frame. It provides a more programmatic interface for specifying what variables to plot, how they are displayed, and general visual properties. Therefore, we only need minimal changes if the underlying data change or if we decide to change from a bar plot to a scatterplot. This helps in creating publication quality plots with minimal amounts of adjustments and tweaking. Reference.

  • The “gg” here refers to “grammar of graphics”.
  • Every graph consists of one or more geometric layers.

For demonstration, we will be using the built-in data set, mpg.

data(mpg)     
head(mpg)     
manufacturer model displ year cyl trans drv cty hwy fl class
audi a4 1.8 1999 4 auto(l5) f 18 29 p compact
audi a4 1.8 1999 4 manual(m5) f 21 29 p compact
audi a4 2.0 2008 4 manual(m6) f 20 31 p compact
audi a4 2.0 2008 4 auto(av) f 21 30 p compact
audi a4 2.8 1999 6 auto(l5) f 16 26 p compact
audi a4 2.8 1999 6 manual(m5) f 18 26 p compact

2.2 Layered Grammar of Graphics

For our illustration of functions in ggplot2 in Lab 2, the layered grammar of graphics follows the template below. We will go through them each by each in the following sections.

 ggplot(data = <DATA>) + 
     <GEOM_FUNCTION>(
         mapping = aes(<MAPPINGS>)) +
     <FACET_FUNCTION> +
     <SCALE_FUNCTION> +
     <LABS_FUNCTION> +
     <THEME_FUNCTION>

2.3 Layers in ggplot2

2.3.1 Geometric Layers

2.3.1.1 Commonly Used geom Functions
  1. geom_point(): to create scatterplots

  2. geom_line(): to create line plots

  3. geom_bar(): to create bar charts of counts

  4. geom_col(): to create bar charts of values

  5. geom_boxplot(): to shows distributions and outliers with boxplots

  6. geom_smooth(): to adds a fitted trend line

  7. geom_jitter(): to aid the visualization of points by adding “jitters” to the locations of points

# create a scatter plot  
ggplot(data = mpg) +
    geom_point(mapping = aes(x = displ, y = hwy)) 

# add another layer
ggplot(data = mpg) +
    geom_point(mapping = aes(x = displ, y = hwy), color = "red") +      # you can also request a specific color
    geom_smooth(mapping = aes(x = displ, y = hwy))

The geom_xxx() functions can inherit both the data and aesthetic mapping from the top level of the plot, due to the argument inherit.aes = TRUE by default (specified the R Documentation). As a result, you can simplify your code as

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
    geom_point(color = "red") +
    geom_smooth()

2.3.1.2 Aesthetic Mapping

Recall our previous code,

Aesthetics in geom_xxx() statement can be specified in two ways:

  1. inside the aes() function, which maps variables to aesthetics, in order to represent or enhance the visual features.

  2. outside the aes() function, which takes fixed values. This step is usually optional.

geom_xxx(aes(ARGUMENTS = variable, ...), ARGUMENTS = fixed values). Some commonly used aesthetics are:

  • x, y: define the variables to be put on the x-axis and y-axis. These have to be defined inside the aes() function.

  • color: defines the colors used to draw lines and strokes.

  • fill: defines the colors used inside areas of geoms.

  • shape: defines the symbols of points.

  • size: defines the size of points.

  • alpha: defines the opacity of geoms.

The examples below show the difference between mapping variables and mapping fixed values to aesthetics.

p1 <- ggplot(data = mpg) +
        geom_point(mapping = aes(x = displ, y = hwy, fill = drv)) # map variable to color 
  
p2 <- ggplot(data = mpg) +
        geom_point(mapping = aes(x = displ, y = hwy), color = "red") # color now is mapped by a fixed value 

p1 + p2 # enabled by "patchwork"              

p3 <- ggplot(data = mpg) +
        geom_point(mapping = aes(x = displ, y = hwy, shape = drv))  # map variable to shape 
  
p4 <- ggplot(data = mpg) +
        geom_point(mapping = aes(x = displ, y = hwy), shape = 2)  # shape now is mapped by a fixed value 

p3 + p4               

p5 <- ggplot(data = mpg) +
        geom_point(mapping = aes(x = displ, y = hwy, size = drv))  # map variable to size 
  
p6 <- ggplot(data = mpg) +
        geom_point(mapping = aes(x = displ, y = hwy), size = 3)  # size now is mapped by a fixed value 

p5 + p6               

p7 <- ggplot(data = mpg) +
        geom_point(mapping = aes(x = displ, y = hwy, alpha = drv))  # map variable to alpha 
  
p8 <- ggplot(data = mpg) +
        geom_point(mapping = aes(x = displ, y = hwy), alpha = 0.1)  # alpha now is mapped by a fixed value 

p7 + p8               

p9 <- ggplot(data = mpg) +
        geom_bar(mapping = aes(x = class, fill = drv))  # map variable class to fill
  
p10 <- ggplot(data = mpg) +
       geom_bar(mapping = aes(x = class), fill = "red")  # fill now is mapped by a fixed value 

p9 + p10               

2.3.1.3 Commonly Used Fixed Values

2.3.2 Facets

2.3.3 Scales

2.3.4 Legends

2.3.5 Themes

2.4 Example

2.5 Useful resources

  • © 2025 Liyuan Xuan — Built with Quarto